Prof Nathan Taback will drop in to answer questions you might have.
There are three important parts of the ISSC (well 4, if you count the most important part, YOU!)
Your mission, should you choose to accept it, is to complete a mini-data visualisation challenge by the end of the day.
You can definitely do this challenge even if you haven’t sorted out your GitHub yet, but I’d strongly recommend making this one of your ISSC goals. More information in the first 6 Sigma Sunday newsletter. You may wish to create a repository to store this mini-project in called ‘ISSC’ or ‘TidyTuesday’ folder. I have one called ‘ISSC’ with the files from the this AND the two previous TidyTuesday & Talks.
Or it could be an R Script, but I prefer RMDs, like what this is written in. It is perfect for when you want your code, outputs and commentary to all be together.
If you haven’t installed tidyverse yet, you will need that package for today. It has dplyr and ggplot in it.
There is more than one way to get this data. I’m going to use the tidytuesdayR package becasue I installed it last week. Choose the way that is right for you from these options.
glimpse(vb_matches)
Rows: 76,756
Columns: 65
$ circuit [3m[90m<chr>[39m[23m "AVP", "AVP", "AVP", "AVP", "AVP", "AVP", "AVP", "AVP", "AVP", "AVP", "…
$ tournament [3m[90m<chr>[39m[23m "Huntington Beach", "Huntington Beach", "Huntington Beach", "Huntington…
$ country [3m[90m<chr>[39m[23m "United States", "United States", "United States", "United States", "Un…
$ year [3m[90m<dbl>[39m[23m 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002,…
$ date [3m[90m<date>[39m[23m 2002-05-24, 2002-05-24, 2002-05-24, 2002-05-24, 2002-05-24, 2002-05-24…
$ gender [3m[90m<chr>[39m[23m "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
$ match_num [3m[90m<dbl>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, …
$ w_player1 [3m[90m<chr>[39m[23m "Kevin Wong", "Brad Torsone", "Eduardo Bacil", "Brent Doble", "Albert H…
$ w_p1_birthdate [3m[90m<date>[39m[23m 1972-09-12, 1975-01-14, 1971-03-11, 1970-01-03, 1970-05-04, 1974-07-21…
$ w_p1_age [3m[90m<dbl>[39m[23m 29.69473, 27.35661, 31.20329, 32.38604, 32.05476, 27.84120, 31.47981, 2…
$ w_p1_hgt [3m[90m<dbl>[39m[23m 79, 78, 74, 78, 75, 75, 78, 77, 75, 79, 73, 79, 78, 77, 73, 77, 79, 78,…
$ w_p1_country [3m[90m<chr>[39m[23m "United States", "United States", "Brazil", "United States", "United St…
$ w_player2 [3m[90m<chr>[39m[23m "Stein Metzger", "Casey Jennings", "Fred Souza", "Karch Kiraly", "Jeff …
$ w_p2_birthdate [3m[90m<date>[39m[23m 1972-11-17, 1975-07-10, 1972-05-13, 1960-11-03, 1972-08-03, 1972-02-01…
$ w_p2_age [3m[90m<dbl>[39m[23m 29.51403, 26.87201, 30.02875, 41.55236, 29.80424, 30.30801, 28.25188, N…
$ w_p2_hgt [3m[90m<dbl>[39m[23m 75, 75, 79, 74, 80, 77, 78, 79, 75, 76, 76, 75, NA, 78, 73, 74, 75, 74,…
$ w_p2_country [3m[90m<chr>[39m[23m "United States", "United States", "Brazil", "United States", "United St…
$ w_rank [3m[90m<chr>[39m[23m "1", "16", "24", "8", "5", "12", "13", "4", "3", "14", "22", "6", "26",…
$ l_player1 [3m[90m<chr>[39m[23m "Chuck Moore", "Mark Paaluhi", "Adam Jewell", "David Swatik", "Adam Rob…
$ l_p1_birthdate [3m[90m<date>[39m[23m 1973-08-18, 1971-03-08, 1975-06-24, 1973-02-14, 1976-01-25, 1979-02-10…
$ l_p1_age [3m[90m<dbl>[39m[23m 28.76386, 31.21150, 26.91581, 29.27036, 26.32717, 23.28268, 30.23956, 2…
$ l_p1_hgt [3m[90m<dbl>[39m[23m 76, 75, 77, 76, 73, NA, 75, 75, 68, 75, 77, 74, 78, 73, 79, 73, 78, 74,…
$ l_p1_country [3m[90m<chr>[39m[23m "United States", "United States", "United States", "United States", "Un…
$ l_player2 [3m[90m<chr>[39m[23m "Ed Ratledge", "Nick Hannemann", "Collin Smith", "Mike Mattarocci", "Ji…
$ l_p2_birthdate [3m[90m<date>[39m[23m 1976-12-16, 1972-01-12, 1975-05-26, 1969-10-05, 1978-03-26, 1969-05-30…
$ l_p2_age [3m[90m<dbl>[39m[23m 25.43463, 30.36277, 26.99521, 32.63244, 24.16153, 32.98289, 30.13826, 2…
$ l_p2_hgt [3m[90m<dbl>[39m[23m 80, 78, 76, 80, 75, 76, 81, 77, 77, 74, 73, 73, 72, 73, 71, 78, 75, 79,…
$ l_p2_country [3m[90m<chr>[39m[23m "United States", "United States", "United States", "United States", "Un…
$ l_rank [3m[90m<chr>[39m[23m "32", "17", "9", "25", "28", "21", "20", "29", "30", "19", "11", "27", …
$ score [3m[90m<chr>[39m[23m "21-18, 21-12", "21-16, 17-21, 15-10", "21-18, 21-18", "21-16, 21-15", …
$ duration [3m[90m<time>[39m[23m 00:33:00, 00:57:00, 00:46:00, 00:44:00, 01:08:00, 00:55:00, 00:46:00, …
$ bracket [3m[90m<chr>[39m[23m "Winner's Bracket", "Winner's Bracket", "Winner's Bracket", "Winner's B…
$ round [3m[90m<chr>[39m[23m "Round 1", "Round 1", "Round 1", "Round 1", "Round 1", "Round 1", "Roun…
$ w_p1_tot_attacks [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p1_tot_kills [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p1_tot_errors [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p1_tot_hitpct [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p1_tot_aces [3m[90m<dbl>[39m[23m 1, 0, 0, 0, 1, 0, 0, 0, 1, 2, 4, 0, 1, 0, 0, 2, 1, 1, 1, 1, 1, 0, 1, 1,…
$ w_p1_tot_serve_errors [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p1_tot_blocks [3m[90m<dbl>[39m[23m 7, 4, 2, 3, 0, 0, 0, 0, 2, 3, 0, 3, 4, 0, 2, 1, 4, 7, 0, 0, 2, 0, 0, 4,…
$ w_p1_tot_digs [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p2_tot_attacks [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p2_tot_kills [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p2_tot_errors [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p2_tot_hitpct [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p2_tot_aces [3m[90m<dbl>[39m[23m 2, 4, 0, 0, 0, 0, 0, 0, 0, 4, 0, 1, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 2,…
$ w_p2_tot_serve_errors [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ w_p2_tot_blocks [3m[90m<dbl>[39m[23m 0, 0, 4, 0, 6, 0, 0, 3, 3, 1, 5, 0, 0, 1, 0, 1, 0, 0, 1, 5, 2, 4, 5, 0,…
$ w_p2_tot_digs [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p1_tot_attacks [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p1_tot_kills [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p1_tot_errors [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p1_tot_hitpct [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p1_tot_aces [3m[90m<dbl>[39m[23m 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,…
$ l_p1_tot_serve_errors [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p1_tot_blocks [3m[90m<dbl>[39m[23m 0, 2, 1, 2, 0, 0, 0, 0, 0, 1, 9, 1, 1, 1, 1, 0, 3, 0, 0, 3, 1, 2, 1, 0,…
$ l_p1_tot_digs [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p2_tot_attacks [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p2_tot_kills [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p2_tot_errors [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p2_tot_hitpct [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p2_tot_aces [3m[90m<dbl>[39m[23m 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 0, 1, 4, 0, 0, 1, 0, 0, 0, 0, 1, 2, 0,…
$ l_p2_tot_serve_errors [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ l_p2_tot_blocks [3m[90m<dbl>[39m[23m 1, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 1, 0, 5, 1, 1, 2, 0, 0, 0,…
$ l_p2_tot_digs [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
This data is pretty clean and tidy but we might want to play with a few things. I wanted to make seperate datasets so I could look at data by individual players across all their matches and look at general data about the players and the match.
# make a dataset with just information about the match
match_info <- vb_clean %>%
select(-contains("p1"), -contains("p2"), -contains("player")) %>%
separate(score, into=c("score_set1", "score_set2", "score_set3"), sep = ",")
Expected 3 pieces. Missing pieces filled with `NA` in 52319 rows [1, 3, 4, 7, 8, 9, 10, 12, 15, 16, 17, 18, 21, 23, 24, 25, 26, 27, 28, 29, ...].
You might find the Cookbook for R graphics from the BBC helpful, as well as the resources in 6 Sigma Sunday #2 on using dplyr and ggplot.
What is the usual difference between scores in set 1 of a match?
What proportion of matches go to the third set?
Win rates by players?
Are players getting any taller?
I’ve chosen the height/gender/age.
The average height of Canadian men is 5’ 10" (70 inches) and the average height of Canadian women is 5’ 4" (64 inches). Source: https://www.cbc.ca/news/health/height-growth-canada-1.3695398
## model with gender interaction
summary(lm(hgt~birthdate*gender, data = winrate_filter))
Call:
lm(formula = hgt ~ birthdate * gender, data = winrate_filter)
Residuals:
Min 1Q Median 3Q Max
-10.958 -1.638 -0.057 1.815 9.846
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.476e+01 1.405e-01 532.202 < 2e-16 ***
birthdate 7.673e-05 2.209e-05 3.474 0.000518 ***
genderW -4.894e+00 2.089e-01 -23.422 < 2e-16 ***
birthdate:genderW -3.841e-05 3.121e-05 -1.230 0.218593
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.594 on 4287 degrees of freedom
Multiple R-squared: 0.4916, Adjusted R-squared: 0.4913
F-statistic: 1382 on 3 and 4287 DF, p-value: < 2.2e-16
If you run ?ggsave, it will tell you that “ggsave() is a convenient function for saving a plot. It defaults to saving the last plot that you displayed, using the size of the current graphics device. It also guesses the type of graphics device from the extension.”
ggsave("vb_heights_birthyear_gender.png")
Saving 7 x 7 in image
BONUS BONUS!: Cowplot